Skip to content

feat(schema): represent, serialize and validate v3 column default values (1/4)#746

Open
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-schema
Open

feat(schema): represent, serialize and validate v3 column default values (1/4)#746
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 15, 2026

Copy link
Copy Markdown

Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.

This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.

What's in this PR

  • SchemaField carries initial-default / write-default, stored as
    std::shared_ptr<const Literal> (immutable payload shared across copies, like the
    adjacent type_; the C++ analog of Java's final Literal<?>). They are set via the
    constructor. Getters return std::optional<std::reference_wrapper<const Literal>> for
    reading (the Schema::FindFieldByName idiom); initial_default_ptr() /
    write_default_ptr() expose the shared pointer so a rebuilt field (e.g. ID
    reassignment) shares the value instead of copying it.
  • JSON serde: parse/write initial-default / write-default using the existing
    single-value serialization (all primitive types).
  • Schema::Validate: version-gates the initial-default to format v3
    (kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,
    so it requires the v3 reader contract. The write-default only affects values written
    going forward and is not version-gated (matching Java's Schema.checkCompatibility,
    which gates only the initial default). Both defaults are otherwise validated to be
    non-null primitive literals matching the field type.
  • Generic projection: a column missing from a data file with an initial-default
    maps to FieldProjection::Kind::kDefault carrying the literal (the per-format readers
    consume this in the follow-up PRs).

Follow-ups (stacked on this PR)

  • read path — Parquet (literal_util + parquet projection/materialization)
  • read path — Avro
  • schema evolution (UpdateSchema add/update column defaults)

Testing

Added tests

Comment on lines +148 to +149
std::shared_ptr<const Literal> initial_default_;
std::shared_ptr<const Literal> write_default_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReassignField constructs a new SchemaField via the 5-argument constructor which initializes initial_default_ and write_default_ to nullptr. When schema IDs are reassigned (e.g., copying a schema with fresh IDs via the Schema(get_id) path), all default values on fields are silently lost. We should copy all field properties including initialDefault and writeDefault.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, confirmed. Defaults are now constructor args, and ReassignField passes the source field's initial_default_ptr()/write_default_ptr() through, so they're shared with the reassigned field, not lost. Added ReassignIdsPreservesDefaultValues.

Comment thread src/iceberg/json_serde.cc
Comment on lines +571 to +580
if (initial_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*initial_default_json, field.type().get()));
field = field.WithInitialDefault(std::move(literal));
}
if (write_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*write_default_json, field.type().get()));
field = field.WithWriteDefault(std::move(literal));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deserialization first constructs a bare SchemaField, then conditionally calls WithInitialDefault/WithWriteDefault, each of which copies the entire field (including the shared_ptr<Type>). This is an unnecessary intermediate copy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — FieldFromJson now parses the defaults first and builds the field in one construction. Intermediate copy gone.

Comment thread src/iceberg/schema_field.cc Outdated
Comment on lines +76 to +80
SchemaField SchemaField::WithInitialDefault(Literal initial_default) const {
SchemaField copy = *this;
copy.initial_default_ = std::make_shared<const Literal>(std::move(initial_default));
return copy;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's need to copy the whole SchemaField, can we just set the initial_default_ field and return *this.
Also the following With methods.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — moved defaults into the constructor and removed both With... methods, so construction no longer copies the field.

@huan233usc huan233usc force-pushed the feat/default-values-schema branch 3 times, most recently from 1ee5b32 to 34470af Compare June 16, 2026 05:30
@huan233usc huan233usc requested a review from WZhuo June 16, 2026 05:38

@WZhuo WZhuo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).

Changes:

  • Extend SchemaField to store initial-default / write-default as shared immutable literals and include them in equality/ID-reassignment rebuilds.
  • Add JSON serde for initial-default / write-default using existing single-value literal serialization.
  • Update schema projection to use FieldProjection::Kind::kDefault when an expected field is missing but has initial-default, and add/extend unit tests + v3 metadata fixture.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/iceberg/test/schema_util_test.cc Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present.
src/iceberg/test/schema_test.cc Adds schema validation/version-gating and ID-reassignment preservation tests for default values.
src/iceberg/test/schema_json_test.cc Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs).
src/iceberg/test/resources/TableMetadataV3Valid.json Adds a v3-valid table metadata JSON fixture.
src/iceberg/schema.cc Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate.
src/iceberg/schema_util.cc Projects missing fields with initial-default as kDefault rather than error/null.
src/iceberg/schema_field.h Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors.
src/iceberg/schema_field.cc Implements default accessors, validation of defaults, and includes defaults in equality.
src/iceberg/json_serde.cc Serializes/parses initial-default / write-default on schema fields via literal single-value serde.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/schema_field.cc
Comment thread src/iceberg/schema.cc
First of a multi-part split of column default value support (apache#730) — the
schema foundation the read and evolution paths build on. Purely additive;
no read/write behavior change on its own.

- SchemaField carries `initial-default` / `write-default` (immutable
  std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault /
  WithWriteDefault modifiers; getters return optional<reference_wrapper>.
- JSON serde reads/writes `initial-default` / `write-default` via the
  existing single-value serialization.
- Schema::Validate rejects default values below format v3 and validates
  they are non-null primitive literals matching the field type.
- Generic schema projection maps a column missing from a data file with an
  initial-default to FieldProjection::Kind::kDefault.

Read-path application (Parquet/Avro) and schema evolution follow in separate
PRs. See apache#731 for the full end-to-end proof-of-concept.
@huan233usc huan233usc force-pushed the feat/default-values-schema branch from 34470af to f663e0e Compare June 18, 2026 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants